Skip to content

Add backend-aware test parameter scaling for pure Python CI#129

Merged
igerber merged 3 commits intomainfrom
review-ci-tests
Feb 8, 2026
Merged

Add backend-aware test parameter scaling for pure Python CI#129
igerber merged 3 commits intomainfrom
review-ci-tests

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Feb 1, 2026

Summary

  • Add CIParams class and ci_params session-scoped pytest fixture in conftest.py that detects pure Python mode and scales down bootstrap iterations and TROP grid sizes
  • Update ~89 test functions across 7 test files to use ci_params.bootstrap(n) for bootstrap iteration counts and ci_params.grid(values) for TROP lambda grids
  • Make 11 hardcoded assertion values dynamic to match scaled parameters
  • Leave API-only tests (no .fit() calls) and warning-threshold tests unchanged

Methodology references (required if estimator / math changes)

  • N/A - no methodology changes. Only test parameter scaling; same code paths exercised with fewer iterations in pure Python mode.

Validation

  • Tests added/updated: tests/conftest.py, tests/test_trop.py, tests/test_staggered.py, tests/test_wild_bootstrap.py, tests/test_sun_abraham.py, tests/test_estimators.py, tests/test_methodology_callaway.py, tests/test_methodology_did.py
  • All 387+ tests across modified files pass with Rust backend
  • With Rust backend, all values pass through unchanged (zero behavior change)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Introduce ci_params session-scoped fixture that scales bootstrap iterations
and TROP grid sizes when running without Rust backend, reducing CI time
while preserving code path coverage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 1, 2026

Overall assessment: ⚠️ Needs changes

Executive summary:

  • P1: The TROP paper-conformance test now accepts as few as 11 successful bootstrap draws even when n_boot=30, weakening the regression guard for stratified sampling.
  • P2: Bootstrap scaling in pure Python mode reduces methodology convergence tests to ~35 iterations, which can make SE-convergence assertions noisy and less meaningful.
  • No estimator or inference logic changes detected in this diff; methodology registry requirements appear unchanged.
  • AGENTS.md not found in repo; proceeded without additional agent instructions.

Methodology

  • Severity: P1 | Impact: The paper-conformance test for stratified bootstrap now allows as few as 11 successful draws even when n_boot=30, which weakens detection of regressions in Algorithm 3’s stratified sampling behavior. | Concrete fix: Make the threshold proportional to n_boot (e.g., >= max(11, int(0.67 * n_boot)) or >= n_boot - 5) so Rust-backend runs keep the stronger expectation while pure Python scaling remains feasible. tests/test_trop.py:L1615-L1669
  • Severity: P3 | Impact: No estimator/assumption/inference code paths were changed in this diff; registry requirements are unaffected. | Concrete fix: None. docs/methodology/REGISTRY.md:L1-L40

Code Quality

  • Severity: P3 | Impact: No code-quality issues identified in the changed test/fixture code beyond findings above. | Concrete fix: None.

Performance

  • Severity: P3 | Impact: Changes are test-only performance optimizations; no production performance impact observed. | Concrete fix: None.

Maintainability

  • Severity: P3 | Impact: No maintainability issues introduced beyond findings above. | Concrete fix: None.

Tech Debt

  • Severity: P3 | Impact: No new tech-debt concerns introduced in this diff. | Concrete fix: None.

Security

  • Severity: P3 | Impact: No security-sensitive changes or secrets observed in the diff. | Concrete fix: None.

Documentation/Tests

  • Severity: P2 | Impact: Methodology convergence tests now use ci_params.bootstrap(499), which scales to ~35 iterations in pure Python mode; this can make the 20–25% SE-convergence assertions noisy and reduce methodological verification strength. | Concrete fix: Exempt these convergence tests from scaling (keep high n_boot), or conditionally relax/skip when n_boot falls below a minimum threshold. tests/conftest.py:L99-L113, tests/test_methodology_callaway.py:L803-L839

…convergence tests

The sqrt(n)*1.6 scaling reduced n=499 to 35 iterations — too aggressive
for SE convergence tests. Add min_n keyword to CIParams.bootstrap() so
these tests can request a higher floor. Also make TROP stratified bootstrap
threshold proportional to n_boot (restoring Rust-mode behavior from main).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 7, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 647841118c89a4dc50179c53bd8e73e0b203c6d8


Overall assessment: ⚠️ Needs changes

Executive summary

  • No estimator/math/identification logic changes in the diff; methodology registry alignment remains unchanged.
  • ci_params fixture centralizes backend-aware scaling for bootstrap and TROP grids, improving CI performance.
  • Potential pure‑Python CI failure: stratified bootstrap test now requires 100% success when scaled to 11 draws.

Methodology

  • No findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P1 — Impact: In pure‑Python mode ci_params.bootstrap(30) scales to 11, and the new threshold min_successes = max(11, int(0.67 * n_boot)) requires all 11 bootstrap draws to succeed, despite the test’s own comment (“Most iterations succeed”). If any draw fails (e.g., solver non‑convergence), this becomes a deterministic CI failure in Python backend. — Concrete fix: Remove the hard 11‑draw floor or cap by a fraction only (e.g., min_successes = max(3, math.ceil(0.67 * n_boot))), or set a higher min_n for this test so the 67% threshold remains meaningful. (tests/test_trop.py:L1650-L1671)

… Python CI

In pure Python mode (n_boot=11), max(11, int(0.67*11)) required 100%
bootstrap success rate. Lowering floor to 5 gives max(5,7)=7 (~64%),
consistent with Rust mode's 67% threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 7, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: cf37b5ae5ad7e45ab14548c2c3a15b8983da5dca


Overall assessment: ✅ Looks good

Executive summary

  • No estimator, weighting, variance/SE, or identification logic changes in this diff; Methodology Registry impact is nil.
  • CIParams introduces backend-aware scaling; most bootstrap/grid usage now adapts to pure-Python CI.
  • Minor follow-ups: one SDID test still bypasses scaling; CIParams.grid lacks direct unit coverage; one docstring still claims fixed 499 iterations.

Methodology

  • No findings (tests-only changes; no estimator/math changes to cross-check against docs/methodology/REGISTRY.md).

Code Quality

  • No findings.

Performance

  • P3 — Impact: one .fit() test still uses a fixed n_bootstrap=50, so pure-Python CI won’t get the intended scaling speedup. Fix: inject ci_params and use ci_params.bootstrap(50) unless this test must lock to 50 for a warning threshold. Location: tests/test_estimators.py:L3363-L3371.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P3 — Impact: CIParams.grid behavior is untested; future refactors could silently change grid scaling. Fix: add unit tests covering len(values) <= 3 passthrough and len(values) > 3 first/middle/last selection. Location: tests/conftest.py:L118-L123.
  • P3 — Impact: docstring claims a fixed “499 bootstrap iterations,” but pure-Python mode now scales down (e.g., to 199), which can mislead reviewers. Fix: update text to “up to 499” or mention backend-aware scaling. Location: tests/test_methodology_callaway.py:L803-L815.

@igerber igerber merged commit 4d128ee into main Feb 8, 2026
7 checks passed
@igerber igerber deleted the review-ci-tests branch February 8, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant